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1.  Intradactlaa 

We  present  and  analyze  a  strategy  for  processing  joins  on  a  highly  parallel  computer 
architecture.  We  model  the  architecture  as  consisting  of  n  identical  processor- memory 
clusters  interconnected  by  a  symmetric  network  onto  which  many  processors  may  send  at 
once.  The  strategy  entails  partitjoning  each  relation  horizontally  based  on  a  perfect  hashing 
ftinction  applied  to  a  key  of  the  relation. 

The  basic  join  algorithm  consists  of  projecting  each  relation  on  the  joining  and  result 
colimins  and  then  sending  each  tnincated  tuple  to  the  ;'th  processor  if  the  hash  fiinction 
applied  to  the  join  columns  of  the  tuple  yields  }.  Processor  j  then  performs  a  local  join, 
producing  the  result. 

We  consider  three  variations  on  the  basic  algorithm  for  the  case  where  the  join  columns 
of  at  least  one  relation  do  not  include  a  key  (so  there  will  be  duplicate  values):  combining, 
taggjng,  and  smearing. 

Combining  is  a  network  operation,  whereby  network  switches  filter  out  some  of  the  duplicate 
data  destined  for  the  same  processor.  As  one  mi^t  expect,  this  helps  when  there  are  many 
duplicates. 

Tailing  changes  the  basic  dlgorithm  by  having  the  originating  processor  project  on  the  join 
columns  only  (not  the  result  columns)  on  one  of  the  relations,  then  send  each  truncated  tuple 
to  some  destination  processor.  The  destination  processor  sends  this  tuple  back  if  it 
determines  that  the  tuple's  join  column  values  are  matched  by  some  tuple  in  the  other 
relation.  We  show  that  this  improves  performance  when  there  are  far  fewer  distinct  join 
column  values  than  join  and  result  column  values. 

Smearing  changes  the  basic  algorithm  by  copying  the  tuples  of  one  processor,  say  i,  to  several 
neighboring  processors  for,  say  relation  5,  in  order  to  allow  tuples  from  the  otlier  relation, 
say  R,  that  would  normally  hash  to  i  to  hash  to  any  of  the  neighbors  of  i. 

Our  analysis  depends  on  the  properties  of  a  particular  interconnecdon  scheme  (the 
omega  network),' our  approach  may  be  more  generally  applicable. 

2.  ReUtodWork 

Our  work  is  related  to  research  in  four  areas:  distributed  query  processing,  especially 
[CH82,  GS82];  semi-join-based  processing,  particularly  [BC81,  B79];  database  machines, 
particularly  [F79,  KTM84];  and  parallel  algorithms  [BDPV83,  V83].  General  query 
optimization  strategies  for  complex  queries  [GoodSimu82,  JKS4,  Schm79]  will  be  relevant  to 
later  stages  of  our  work. 

Because  we  ausume  a  network  whose  speed  is  comparable  to  local  memory  access  time 
and  we  assume  that  this  network  can  be  shared,  our  cost  assumptions  differ  from  those  made 
in  much  of  the  research  on  distributed  query  processing.  (Our  assumptions  are  based  on  the 
environment  to  be  provided  by  the  New  York  University  ultracomputer.)  Moreover,  our 
design  is  based  on  partitioning  relations  across  all  processors.  In  contrast,  most  distributed 
query  processing  research  is  concerned  with  minimizing  the  volume  of  communicated  data  in 
the  absence  of  partitioning  [BGWRR81,  ES80,  HY78,  HY79,  WY76].  Chu  and  Hurley 
[CH821  search  for  conditions  diat  optimize  both  communication  and  processing  costs.   They 


prove  several  theorems  with  strong  intuitive  appeal.  For  example,  they  show  that  in  a  multi« 
processing  setting  consisting  of  coequal  processors  and  memories,  performing  unary 
operations  (such  as  selection  and  projection)  before  joins  reduces  total  costs.  Unfortunately, 
their  work  implicitly  assumes  no  partitioning  so  we  have  not  been  able  to  use  their  stronger 
results.  Gavish  and  Segev  [GS82]  study  the  problem  of  minimizing  communication  costs  for 
intra*relation  queries  such  as  intersection  and  set  difference  for  horizontally  partitioned 
relations.  Because  the  queries  arc  intra-relational,  the  implementation  of  the  queries  consists 
of  combining  different  partitions  at  one  site.  This  makes  each  partition  play  the  role  of  a 
relation,  again  limiting  the  relevance  of  their  work  to  ours. 

Much  of  the  early  research  on  database  machines  was  most  successful  in  optimizing 
selection  and  projection  by  employing  special  purpose  input/output  controllers  that  could 
perform  these  functions  as  data  pused  through  them  [S79,  SL73,  082].  Recently,  various 
tree  structured  interconnection  topologies  have  been  proposed  [GS81,  SZ84]  to  minimize  the 
time  to  communicate  between  procestors.  Research  has  begun  on  a  dataflow  approach  as 
well  [BD80].  In  addition,  several  a^dutectures  have  used  specialized  processors  to  perform 
the  various  functions  such  as  projirction,  selection,  sort,  and  join  [B079,  D81,  MK81].  These 
architectures  put  a  premium  on  the  locality  of  data  for  joins,  because  thnr  interconnection 
networks  make  arbitrary  permuutions  of  data  expensive.  But  this  limits  the  extent  to  which 
more  processors  actually  help,  wLnce  it  is  always  pcaaibie  for  a  join  to  require  a  lot  of 
communication  no  matter  what  fUtermg  tDKiif.tic  ooe  'jses.  Our  strategy  is  to  use  a  network 
which  offers  fast  communication  ajid  then  to  u««  hs;b  partitioning  to  distribute  the  processing 
cost  as  much  as  possible.  For  mutsy  exam,  this  gives  nearly  optimal  speedup  (i.e.  speedup 
proportional  to  the  number  of  nrocr«^v>n  ?<tth  i  log  factor  of  degradation).  In  our  use  ctf 
hashing,  our  work  is  similar  to  the  work  of  KitJturegawa,  Tanaka,  and  Moto-oka  [KTM83, 
KTM84].  The  primary  envirouuisu^  difffxcact  is  that  we  use  a  sharable  network  instead  of 
a  pipelined  ring  network.  We  mutj  anwiyze  improvements  to  our  basic  algorithm  for  cases 
when  it  may  not  work  so  well. 

3.  Bask  Alfortthm 

Our  strategy  is  based  on  partitioning  each  relation  by  hashing  on  a  key  (called  the 
partition  key)  of  ^at  relation.  In  the  oext  sectioa,  we  show  that  this  distributes  the  tuples  of 
each  relation  almost  evenly. 

3.1.  N«(atkui 

In  the  general  case,  we  have  a  join  of  two  relations  R  and  5  on  join  columns  C  (by 
renaming  we  can  assume  a  natural  join)  and  projected  on  attributes  A  from  one  or  both  of 
these  relations.  The  tuples  of  A  at  processor  i  are  labelled  Rl  and  similarly  for  S.  Projection 
of  one  or  more  tuples  Ti  on  attributes  B  is  dexuned  Ti[B].  Projection  removes  duplicates. 
Each  hash  function  A  is  a  function  from  the  values  under  considerations  to  the  numbers 
between  0  and  n— 1,  where  n  is  the  number  of  processors.  Rtag  is  a  tag  indicating  tliat  the 
accompanying  tuple  comes  from  the  relation  R  and  similarly  for  Stag . 

3.2.  Bask  Algorithm 


Basic  Algorithm 

(1)  At  each  processor  i, 

Ri*  :=  RilC  U  A] 
Si*  :-  Si[C  U  A] 

(2)  for  each  t  €  Ri',  send  (t,Rtag)  to  h(t(q). 
for  each  t  €  Si',  send  (t.Stag)  to  h(ttC]). 

(3)  At  each  processor  ;, 

join  the  inaxiung  tuples  from  R  and  S  and 
produce  a  join  result. 

This  basic  algorithm  incorporates  several  possibilities.  For  example,  if  /7[C]  is  the 
partition  key  of  R,  then  none  of  R's  tuples  need  to  be  sent  over  the  network.  Also,  if  none 
of  the  attributes  of  i?  are  in  the  result  of  the  join,  then  this  algorithm  only  sends  the 
projection  Ri[C]  across  the  network,  performing  a  semi- join  [B79,  BC81]  in  effect. 

3.3.  ProccMlng  CMti 

We  have  written  the  algorithm  as  if  the  join  only  starts  after  the  communication  is 
complete.  Actually,  we  can  do  some  processing  diiring  communication.  Suppose  Rd€st  is  the 
set  of  tuples  from  R  reaching  some  destination  processor  and  let  Sd^st  be  the  tuples  from  S 
reaching  that  processor.  We  can  join  the  two  relation  subsets  in  time  0(fidest\  +  \5d€3t\)  by 
preprocessing  the  relations  during  the  communication  step.   Here  is  how. 

When  a  tuple  from  Rdest  (respectively,  5<i»r^,CQmes  to  the  jjrocessor,  insert  it  into  a 
B+-tree  [Schk82]  tagged  R  (respectively,  5)  .ba*^  on  Its  ^  values.  After  all  tuples  have 
entered,  intersect  the  C-values  of  the  two  B-t- -trees,  if&s  takes  time  G(^cUji[C]\  + 
lS<fc*r(C]D,  because  the  C-values  are  in  sorted  order  at  the  leaves  of  the  B-t- -trees.  To  format 
the  output,  take  the  cross  produa  of  the  Rdest  and  StUst  tuples  associated  with  each  C  value 
in  the  intersection. 

Without  pipelining  in  this  way,  the  processing  cost  would  be  0({^d£st\  -f  |S<i«5({))  log 
i^d€st\  +  ^d£sti).   In  a  single  processor,  the  time  would  be  0((lR|  +  iSD)  Jog  (P'l  +  l^D)-^ 

4.  BMk  ProtMbUlty  Resolta 

The  performance  of  this  algorithm  and  its  variants  depends  on  how  hashing  maps  m 
distinct  values  to  0,...,n— 1.  Since  we  are  always  concerned  with  bounding  the  maximum  of 
this  distribution  from  above,  we  use  the  following  two  results. 

Equidistribution  Lemma:  Suppose  /  is  a  function  from  a  set  A/  with  cardinality  m  to 

0,...,fl-l,  such  that  the  probability  that/(j)  is  j,0sjs«-l  is  — . 

It 

(1)  If  m  >s  onlog^,  and  a>l  is  a  constant.  Then 

max  over  i(|{  x  [x   i  M  and/(j:)  =  j}0  s  -=2.  with  probability 

(2)  If  fi  5  m  <-  nlog^,  then 

max  over  i(l{  x  [x   ^  M  and  /(j)  =  i}0  ^  logm  with  probability  bounded  below  by 

(3)  If  «i  <=  n,  then 

max  over  i(K  x  [x    €  A/  and/(jc)  =  j}0  s  logm-(- 1  with  probability  bounded  below  by 
l-(m/ii)^«*. 


'  If  C  i»  not  a  superkey  of  either  relation,  die  complexity  could  increaie  to  CX^de3t\  *  '^d£St\).    In  that 
cue,  the  coioplexity  of  joining  the  two  reiadoiu  in  one  processor  it  0(^\ '  ^D- 


This  lemma  tells  us  that  with  high  probability,  if  M  is  large,  it  will  be  equally 
distributed  within  a  factor  of  2;  and  if  A/  is  small,  no  more  than  logm  + 1  values  from  M  will 
map  to  a  single  value. 

Pigeoning  Lemma:  Suppose/  is  a  function  from  a  set  A/  with  cardinality  m  to  0,...,/i-l, 

such  that  the  probability  that  f(x)  is  i,Os»^n-l  is  — .    If  matnlogn  for  b>2,  then,  with 

ft 

probability  at  least  l-n'*"^^,  for  every  i,  there  is  an  x  ^  A/  such  that/(jc)=i.  Q 

5.  Partttlonint  baaed  on  a  Key 

We  assume  that  each  relation  is  partitioned  horizontally  among  the  processor-memory 
clusters  by  means  of  a  perfect  (sec  [CW77,G81])  hashing  function,  from  a  key  to  {0,  1,  2,  ... 
,  n-1  }.  This  has  the  effect  that  the  probability  that  a  tuple  r  i  R  will  be  assigned  to  any 
particular  processor  is  l/n . 

For  large  relations,  the  equidistribution  lemma  promises  us  an  equal  distribution  within 
a  factor  of  2. 

Example  1:  for  n  =  1000  and  |/?|=  100000,  the  probability  that  every  processor  has  fewer 
than  200  tuples  is  greater  than  1- 10~'. 

6.  Network  AaramptioBa 

We  assume  a  packet-switched  ni^twork  in  wMch  a  message  can  travel  from  one 
processor  to  any  other  in  log^/i  time  in  an  unloaded  network,  interconnecting  n  processors. 
Omega-style  (also  known  as  Oanyan-style)  nptworks  [GL73,  KS83]  realize  this  delay.  Sec 
figure  1.   Moreover,  all  processors  may  send  a  psdret  et  one  time. 

Whereas  our  techniques  arc  .'{pr!>ic;;b7e  to  any  net^i'urk  of  this  type,  our  analysis  depends 
on  certain  specific  properties  of  banynn  fiv^t^'Ojks.  la  a  banyan  network,  there  is  one  path 
through  the  network  switch^  from  sny  p-  rjf_«ssar  to  any  other  (figure  1).  Gansider  a  path 
from  processor  i  to  processor  J.  The  switch  nearest  J  is  numbered  1,  the  switch  feeding  this 
switch  is  numbered  2,  and  so  on  up  to  the  switch,  nearest  i  which  is  numbered  log2/i  (n  is  a 
power  of  2).  According  to  this  numbering,  a  partial  path  from  i  up  to  a  switch  at  level  r  may 
feed  2'  processors.  This  will  be  important  in  our  ansdysis  of  the  combining  technique. 

7.  Communlcatkui  over  a  Banyan  Network 

The  communication  cost  is  the  sum  of  the  cost  of  sending  one  relation  at  a  time.  This 
reduces  the  number  of  cases  we  consider,  but  is  pessimistic  (by  no  more  than  a  factor  of  2)  in 
that  we  can  in  fact  send  both  relations  concurrently,  if  both  must  be  sent. 

We  use  a  sending  protocol  in  which  a  processor  that  has  to  send  a  set  of  tuples  will  send 

with  probability  —  in  any  given  cycle.    There  are  two  cases.    In  the  flrst  case,  the  join 

columns  constitute  a  key  of  the  relation  to  be  sent.   This  results  in  a  nearly  uniform  load  on 
the  intermediate  switches,  giving  us  the  following  lemma. 

Network  T-Mnftm  i  (equiprobable  case):  If  each  processor  t  sends  out,  tuples  and  each 
processor  j  receives  itij  tuples  in  a  communication  step  using  the  above  protocol,  and  each 
tuple  is  equally  likely  to  go  to  any  destination  processor,  then  the  time  the  communication 
step  takes  is  O(logn-¥outnuix-^inmwc),  where  outmax  is  the  maximum  number  of  tuples 
leaving  a  processor  and  inmax  is  the  maximum  number  of  tuples  entering  a  processor.  For 
relations  whose  sizes  exceed  nlogn,  this  becomes  0(oui7miz  +  uumzz).   [] 

In  the  general  case,  C  does  not  constitute  a  key,  so  the  performance  might  degrade  at 
each  of  the  log  n  switch  levels. 


Network  Lemma  2  (general  case):  If  each  processor  i  sends  out,  tuples  and  each 
processor  j  receives  inj  tuples  in  a  communication  step  using  the  above  protocol,  then  the 
time  the  communication  step  takes  is  O(logn*(outmax+inmax)),  where  outmax  is  the 
maximum  number  of  tuples  leaving  a  processor  and  inmax  is  the  maximum  number  of  tuples 
entering  a  processor.  Q 

Example  2:  Suppose  we  join  two  large  relations,  R  and  S,  on  C  and  C  is  an  alternate 
key  (or  superkey)  of  both  relations.  Then  o«onaz=uuTiax  =  (2|/?|/n)+(2|5|/n)  so  the 
communication  time  is  0(([R|  +  |SO/n).  The  additional  processing  time  is  0((]R|  -f  |SO/n) 
assuming  we  can  pipeline  the  construction  of  the  B+  trees  (see  subsection  above  on  basic 
algorithm).  This  is  0(n)  speedup  over  computing  on  a  single  machine. 

Example  3:  Suppose  we  join  two  large  relations,  R  and  S,  on  C  and  C  is  an  alternate 
key  (or  superkey)  of  5,  but  not  of  R.  Suppose  |R[C]|  =  1.  Then  the  communication  and 
processing  times  are  both  OOog  n  *  ({S)/n)  +  |R[).  This  gives  no  speedup  if  [R|  is  large.  It  is 
for  cases  like  these  that  we  consider  optimizations  to  the  basic  algorithm. 

8.  AlgM4thn  Optimkadau 

We  consider  three  optimizations  on  the  basic  algorithm:  combining,  tagging,  and 
smearing.  These  optimizations  may  be  applied  independently  to  the  two  relations.  Thus, 
ombining  may  be  applied  to  relation  R,  whereas  tagging  is  applied  to  relation  S.  However, 
only  tagging  and  smearing  may  apply  together  to  a  single  relation. 

To  see  when  to  apply  these  optimizations,  we  must  characterize  the  distinct  join  cases. 
To  do  so,  we  need  four  propositional  variables  CiskeyafR,  Autaaafli,  CiskeyafS,  and  AUuittofi. 
CiskeyofR  holds  when  the  join  columns  C  are  the  key  that  the  hash  partitioning  of  R  was 
based  on.  AinattafR  holds  when  some  of  the  target  columns  besides  those  in  C  are  in  R. 
That  is  AinattcfR  if  A-C  are  attributes  of  R.  Ciskeya^  ■  and  AinattofS  have  analogous 
meanings  with  respect  to  5.  Here  we  describe  the  -four  cases  that  determine  how  R  is 
processed.  The  decisions  for  S  are  analogous. 

QskeyofR  use  basic  algorithm' 

'GskeyofR  and  'AinattofR      semi-join  projected  on  S,  use  basic  algorithm, 

combining  may  help 

'GskeyofR  and  AinattofR 

and  AinattofS  not  semi-join  case,  use  basic  algorithm 

if  C  y  A  is  not  a  superkey  of  R 
then  combining  may  help 

tagging  does  not  help  in  general 

'QskeyofR  and  AinattofR 

and  'AinattofS  semi-join  projected  on  R, 

if  use  basic  algorithm  and 
C  U  A  is  not  a  superkey  of  R, 
then  combining  may  help 
if  [R[q|  <<  [R(C  U  A]|  then  tag 
if  |R[C]|  ^  log  n  and 
tS(C  U  A]|  ^nV(2|«[C]|) 
then  smear  and  tag 

Combining  entails  changing  the  network  in  order  to  reduce  the  number  of  duplicate 
tuples  from  R  reaching  each  destination  processor. 


Tagging  changes  the  basic  algorithm  by  projecting  on  the  join  columns  only  instead  of 
on  the  join  and  result  columns.  Tagging  also  adds  a  step  to  return  the  tuples  whose  C  values 
are  included  in  the  join  (see  step  (4)  below)  and  one  more  step  (5)  to  produce  the  result. 

Tagging  is  not  generally  useful  when  AinattofS  holds,  because  then  tagging  requires  that 
each  processor  i  send  Ri{C]  tuples  in  step  (2)  and  then  send  /7i[CljA]  tuples  in  step  (5)  of 
the  modified  algorithm  for  those  C  values  that  participate  in  the  join.  We  don't  expect  to  be 
able  to  predict  how  many  of  the  original  i?/[C|J/4]  tuples  the  join  eliminates,  so  we  cannot 
consider  this  variant  to  be  useful. 

Basic  Algorithm  modified  for  tagging  R 

(For  illustrative  purposes,  we  use  the  standard  operations  from  the 
basic  algorithm  for  S.  What  happens  to  the  S  tuples  doesn't  change 
what  happens  to  the  R  tuples.) 

(1)  At  each  prtxressor  i, 

Ri'  :=•  Ri(q 

Si*  :=  Si(C  U  A] 

(2)  for  each  t  €  Ri',  send  (t,Rtag,i)  to  h(t{C]). 
for  each  t  €  Si',  send  (t.Stag)  to  h(t(C]). 

(3)  At  each  pnxxssor  ;', 

join  the  incoming  tuples  from  R  and  S. 

(4)  for  each  (t,Rtag,i) 

if  t  is  in  the  join  result  then  send  (t,Rtag,i)  back  to  i. 

(5)  if  AattinS 

then  {not  generally  um^iui  casef 
for  each  returned  (i,Rtag,i) 

send  all  tuples  t'  (E  RiiC  y  Aj 
such  that  t'[Cj  =  t(C] 
to  h(t{q) 
else  {semi-join  case} 

for  each  returned  (t,Rtag,i) 

put  all  tuples  f  i  Ri(C  \J  A] 
such  that  t'[q  =  t(q 
in  join  result. 

When  |/f[C]|  is  small  (say  lR[C]|  <<«)  and  C  is  not  the  key  of  J?,  a  single  value  from 
R[C\,  say  x,  will  tend  to  be  distributed  over  all  n  processors.  Tagging,  therefore,  allows  the 
possibility  that  some  processor  }  in  step  (3)  may  receive  n  tagged  tuples  from  R  with  value  x. 
To  avoid  this  smearing  modifies  the  algorithm  by  copying  each  S  tuple  to  several  processors. 
This  allows  each  tagged  R[C]  value  to  go  to  any  of  these  several  processors,  reducing  the 
build-up  at  those  processors. 


Basic  Algorithm  modiHed  for  tagging  and  smearing  R 
(As  above,  we  use  the  standard  operations  for  S.) 

(1)  At  each  processor  i, 

Ri*  :-  Ri[q 

Si'  :-  Si(C  U  A] 

(2)  for  each  t  €  Ri',  send  (t,Rtag.i)  to 

some  processor  in  the  range  [(h(t{C])-k)mod  n..  (h(t(C])  +  k)mod  n] 
(send  to  each  processor  in  turn,  deterministically) 
for  each  t  €  Si',  send  (t,Stag)  to 

all  processors  in  [(h(t{Cl)-k)  mod  n..  (h(t{C]-)-k)mod  n]. 

(3)  At  each  processor  j, 

join  the  incoming  tuples  from  R  and  S. 

(4)  for  each  (t,Rtag,i) 

if  t  is  in  the  join  result  d^en  send  (t,Rtag,i)  back  to  i. 

(5)  assume  not(AattinS)  (semi-join  case} 

for  each  retum«l  (t,Rtat^) 

put  all  tuples  f  i  Ri(C  \J  A] 
such  that  t'tq  -  iC\ 
in  join  result. 

9.  Analyiia  of  the  OpdmlaatkMH 

The  objective  of  analysis  ii  to  produce  an  algorithm  for  deciding  when  to  use  each  of 
the  possible  optimizations.  The  rwo  parameters  we  want  to  minimize  are  uutuiz  and  ouonax, 
since  these  are  the  values  that  both  the  communication  costs  and  processing  costs  depend  on. 
We  start  by  introducing  notation  and  assumptions  that  are  common  to  ail  our  analyses.  Then 
we  analyze  combining,  tagging,  and  smearing  in  turn. 

9.1.  Notation  and  AasamptiaBs  tme  Aaalyrii 

LexR[C]  =  {  rc^,...,rc^  },  where  ^[C]\  -  m.  LttR[C\jA]  «  {rca,^^,  •  •  ■  ,rca^},  where 
rca^J[C]=rC|.   Finally,  let  v^  be  the  mmiber  of  processors  containing  rca^. 

If  all  this  information  were  available  and  if  we  knew  dM  exact  distribution  of  the  rca 
values,  we  could  arrive  at  exact  values  of  inituii  and  otomax.  However,  it  is  infeasible  to 
obtain  this  information  in  a  real  system.  Therefore,  we  make  the  following  simplifying 
assumptions  called  uniformity  auumptUma: 

1)  Tlie  number  of  distina  /7[C1JA]  values  whose  projection  on  C  is  re,  is  the  same  for  all  i 
andUr=|«(CUA]|/tR[C]|. 

2)  Each  value  in  R[C\JA]  is  in  v  processors. 

Crude  as  these  assumptions  are,  they  help  us  decide  when  to  apply  the  optimizadons.^ 
One  assumption  that  would  be  too  crude  would  be  to  assume  that  the  R[C\  tuples  are 
distributed  evenly  over  the  destinadon  processors.  The  reason  is  that  |X[C]|<-ffi  could  be 
small.    We  define  the  parameter  <:  to  be  the  mariTniim  number  of  distinct  /?[C]  values  that 

hash  to  a  single  processor.    By  the  equidutribution  lemma,  with  high  probability,  c  ^  if 

/ii>nlogn,  otherwise  c^log/n. 


'  They  undentate  the  desirability  of  \md%  the  opdmLutiom.   For  exampic.  CI)  causei  the  aiulytii  to  nuke 
taggiDg  seem  lest  useful  than  it  could  be.    (2)  m»kjt%  rombmmg  leem  Lett  uMiul  than  it  could  be. 


Basic  Algorithm  Lemma:  Under  the  imiformity  assumptions  and  assuming  mrv >/iIogn, 

M  H 

9.2.  Aaalyiia  of  ComMnlni 

A  combining  network  tries  to  prevent  more  than  one  c»py  of  the  same  tuple  from  going 
to  any  processor.  In  the  ideal  case,  as  soon  as  a  tuple  passes  through  the  network,  the 
network  remembers  it  and  eliminates  every  other  instance  of  that  tuple. 

For  concreteness,  let  us  say  we  are  going  to  apply  combining  to  /f[CUA]  whose 
cardinality  is  mr.  This  will  only  help  if  C(JA  is  not  a  superkey  of  /?.  In  that  case,  tuples 
with  the  same  values  on  those  attributes  should  be  distributed  across  the  processors,  because 
the  partitioning  is  based  on  a  key  of  R.  According  to  our  uniformity  assumption,  each 
R[C[JA]  value  is  in  v  processors.  Thus,  the  total  number  of  tuples  that  will  be  sent  is  mrv. 
These  tuples  are  approximately  equi-distributed  across  the  processors.  Hence  we  have  the 
following  lonma. 

Ideal  Gimbining  Lemma:  Under  the  uniformity  assumptions  above,  ideal  combining 
reduces  inmax  from  crv  to  cr.  [] 

To  see  how  useful  combining  is,  we  should  note  first  that  since  combining  occurs  in  the 
network,  combining  will  not  reduce  outnuix.    Combining  helps  significantly  if  crv>ouonax. 

Since   we   can   approximate   outmax    by    — —    provided   mrv    is   large,    combining   helps 

n 

significantly  whenever  c> — . 

Unfortunately,  the  network  has  no  global  otaclc  to  eliminate  duplicate  values,  so  we 
now  analyze  a  "non-ideal"  implementation  of  combliuiig  ,  which  approximates  existing 
implementations.   Our  model  is  the  foUo%yi<:;ij  ; 

(1)  Each  switch  stores  the  q  distinct  values  that  passed  through  the  switch  most  recently.  If  a 
value  enters  the  switch  that  is  one  of  those  stored,  the  new  value  is  ronoved.  (It  is  a 
duplicate.)  The  value  <}  is  a  desi;^^  paiameicr  of  the  switch. 

(2)  Given  that  a  value  passed  through  a  switch  fjt  least  once  in  the  past,  the  probability  that  it 

is  stored  is  equal  to  -^  where  /,  is  the  number  of  distinct  values  that  ever  pass  through  switch 

s. 

(3)  Due  to  network  symmetry  ,  half  the  distinct  values  passing  through  a  switch  follow  each 
of  the  2  outputs  (we  assume  here  2  by  2  switches). 

Fact:  Given  these  assumptions,  the  probability  of  that  a  value  is  removed  at  a  switch  of 
level  i  (measured  from  the  destination  end,  see  network  description)  is  the  same  for  all 
switches  of  level  i. 

According  to  our  assumptions  above,  at  most  2cr  distinct  values  pass  through  a  switch 
connected  to  a  destination  processor.  Hence,  the  combining  probability  at  the  last  switch  is 
P;=mtn(l,4/2cr)  The  combining  probability  at  the  stage  i  (measured  from  destination)  is  p, 
equal  to  m«i(l,<j/(cr(2'*^))). 

A  particular  value  passes  through  the  network  without  combining  ,  with  probability 
Pass  equal  to  (l-Pi)(l-/»:)  •  (1~P:o,b)-  Therefore,  inmax  is  reduced  from  crv  to 
cr + {crv — cr)Pass . 

Non*ideal  combining  Lemma:  Let  a  be  the  ratio  between  the  maximum  number  of 
distinct  /?[C(JiA]  values  arriving  at  one  destination  processor,  cr,  and  the  memory  size,  q.   If 

the  uniformity  assumptions  hold  then  inmax  is  reduced  from  crv  to  cr-^(crv  —  cr)(l ).   [J 
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Example:  If  — =3,  then  Paw  =  |-  jo  inmax  =  ^+-^^. 
(f  3  3         3 

So,  non*ideal  combining  is  as  useful  as  ideal  combining  only  if  the  size  of  the  memory  is 
approximately  the  number  of  distinct  /{[C(JA]  values  arriving  at  a  single  destination 
processor. 

9.3.  ABBlyali  of  Taoiiif 

Tagging  reduces  both  outmax  and  inmax  in  its  first  communication  step,  but  requires  an 
extra  communication  step  whose  cost  is  no  more  than  the  cost  of  the  first  one.  (We  consider 
here  the  semi-join  case  when  AinattofS  is  false.  So  there  is  only  one  extra  sending  step.) 

Tagging  Lemma:  Under  the  uniformity  assumptions  above,  tagging  reduces  outmax  to 
min(mry/n,m)  and  inmax  to  c  *  min(n,rv).   [] 

Example:   Suppose  r=     Lf^,.     =10000,   v  =  10,    and  n  =  1000.     Combining  gives 

outmax=100m  and  inm<zz=  10000c.  Tagging  gives  outmax  =  m  and  mmaz=  1000c.  However, 
note  that  tagging  requires  sending  values  back,  whose  effect  we  can  approximate  by  doubling 
inmax  and  outmax.  Note  also  that  we  get  a  degradation  by  a  factor  of  10  from  optimal 
speed«up  in  this  case. 

9.4.  Analytia  ai  Smcariag 

Intuitively,  smearing  helps  when  m  is  small,  causing  many  values  to  go  to  one  processor 
whereas  few  go  to  its  neighbors.  Since  smearing  requires  that  S  tuples  be  copied,  li[C|Ji4]| 
should  also  be  small.  In  this  analysis,  we  assume  that  |/?|  is  large  enough  so  every  R[C]  value 
is  in  every  processor,  by  the  pigeoning  lemma.  ,    ^ 

Using  tagging  alone  in  that  case,  inmax  »  nc.    Suppose  that  our  smearing  parameter 

k—m/2.   Then  for  R,  inmax=  J!^ .  ■    (This  is  actually  conservative,  because  it  suggests  that 

all  R[C\  values  hash  to  an  interval  of  processors  of  2k+l  processors.    In  the  best  case, 

destination  processors  are  more  than  it  processors  apart  and  inmax  decreases  to     "^    .) 

The  communication  cost  for  S  however  increases.  There  are  two  cases:  either  no  tuples 
of  5  would  have  been  sent  if  R  were  not  smeared,  in  which  case  C  is  a  key  of  5  and  inmax 
and  outmax  increase  to  2Jk|S[C(Ji4]|/n;  otherwise,  tuples  from  S  were  sent  and  inmax  and 
outmax  increase  by  a  factor  of  2k. 

Example:  Suppose  |5[C(jA]|=aA  for  some  constant  a  and  |/;[C]|=m.  Suppose  further 

that  C  is  the  partition  key  of  S.   1[k=—  and  m  is  small,  then  inmax=    ""*     +  ma,  whereas 

by  tagging  alone  inmax=n(logm+l).  Thus,  smearing  only  helps  if  nlogm>2am,  which  is  our 
decision  condition  in  section  8.  This  takes  into  account  the  fact  that  outmax  increases  by  ma 
using  smearing. 

10.  Cwidiuioa 

We  propose  and  analyze  a  method  for  performing  joins  using  symmetric  processors 
interconnected  by  a  high  bandwidth  network.  Our  method  gives  an  optimal  performance 
speedup  witliin  a  constant  factor  for  many  cases  of  the  join. 

Our  analysis  suggests  that  hardware  architectural  features  such  as  combining  can  be 
useful  for  join  processing.  Our  analysis  also  suggests  that  the  tagging  technique  often 
improves  performance  when  the  join  operation  reduces  to  a  semi-join.  On  the  negative  side, 
our  analysis  shows  that  smearing  --  a  technique  for  which  we  had  great  hope  ••  is  only  rarely 
useful. 
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The  main  open  problemi  are  to  study  tbe  system  experimentally  to  gain  a  better 
understanding  of  the  constant  factors  in  communication  time;  to  study  the  implementation 
and  performance  issues  concerning  the  maintenance  of  data  structiues  and  of  information 
about  the  distribution  of  values  in  non-prime  attributes;  and  to  extend  this  work  to  a  general 
query  processing  strategy. 
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